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Abstract. The use of anonymity-based infrastructures and anonymisers is a plausible solu- 
tion to mitigate privacy problems on the Internet. Tor (short for The onion router) is a popular 
low-latency anonymity system that can be installed as an end-user application on a wide range 
of operating systems to redirect the traffic through a series of anonymising proxy circuits. The 
construction of these circuits determines both the latency and the anonymity degree of the Tor 
anonymity system. While some circuit construction strategies lead to delays which are toler- 
ated for activities like Web browsing, they can make the system vulnerable to linking attacks. 
We evaluate in this paper three classical strategies for the construction of Tor circuits, with re- 
spect to their de-anonymisation risk and latency performance. We then develop a new circuit 
selection algorithm that considerably reduces the success probability of linking attacks while 
keeping a good degree of performance. We finally conduct experiments on a real-world Tor 
deployment over PlanetLab. Our experimental results confirm the validity of our strategy and 
its performance increase for Web browsing. 
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1 Introduction 

Several anonymity designs have been proposed in the literature with the objective of achieving 
anonymity on different network technologies. From simple pseudonyms [1] to complex unstruc- 
tured protocols [2], anonymity solutions can offer either strong anonymity with high latency (useful 
for high latency services, such as email and Usenet messages) or weak anonymity with low-latency 
(useful, for instance, for Web browsing). The most widely-used low-latency solution for traditional 
Internet communications is based on anonymous mixes and onion routing [3]. It is distributed as 
a free software implementation known as Tor (The onion router [4]). It can be installed as an end- 
user application on a wide range of operating systems to redirect the traffic of low-latency services 
with a very acceptable overhead. Tor's objective is the protection of privacy of a sender as well as 
the contents of its messages. To do so, it transforms cryptographically those messages and mixes 
them via a circuit of routers. The circuit routes the message in an unpredictable way. The content 
of each message is encrypted for every router in the circuit, with the objective of achieving anony- 
mous communication even if a set of routers are compromised by an adversary. Upon reception, 
a router decrypts the message using its private key to obtain the following hop and cryptographic 
material on the path. This path is initially defined at the beginning of the process. Only the entity 
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that creates the circuit knows the complete path to deliver a given message. The last router of the 
path, the exit node, decrypts the last layer and delivers an unencrypted version of the message to 
its target. 

Tor allows the construction of anonymous channels with latency enough to route traffic for 
services like the Web [5], However, it might still impact its performance depending on the spe- 
cific strategy used for the establishment of the channel. In this paper, we address the influence of 
circuit construction strategies on the anonymity degree of Tor. We first provide a formal defini- 
tion of the selection of Tor nodes process, of the adversary model targeting the communication 
anonymity of Tor users, and an analytical expression to compute the anonymity degree of the Tor 
infrastructure based on the circuit construction criteria. Based on these definitions, we evaluate 
three classical strategies, with respect to their de-anonymisation risk, and regarding their perfor- 
mance for anonymising Internet traffic. We then present the construction of a new circuit selection 
algorithm that aims at reducing the success probability of linking attacks while providing enough 
performance for low-latency services. A series of experiments, conducted on a real-world Tor de- 
ployment over PlanetLab [6] confirm the validity of the new strategy, and show its superiority over 
the classical ones. 

Paper organisation — Section 2 presents the rationale of our work. Section 3 evaluates the 
anonymity degree of three traditional strategies for the construction of Tor circuits. Section 4 
presents our new strategy. Section 5 evaluates the anonymity degree of our solution. Section 6 
experimentally evaluates the latency performance of each strategy using PlanetLab. Section 7 sur- 
veys related work. Section 8 concludes the paper. 



2 Rationale 



In this section, we introduce the notation, models, and core definitions that are necessary to under- 
stand the rationale of our work. 



2.1 Tor circuit 



Formally, we can describe a connection using the Tor network as follows. First, we define a client 
node s called a client or onion proxy, and a destination server node d which we want to inter- 
connect to exchange data in an anonymous manner. Let N be the set of nodes deployed in the 
Tor network, and n = \N\ the cardinality of the set. Let node e £ N denote a specified node, 
called the entrance node, and x £ N the exit node. Then, a Tor circuit is a sequence of nodes 
C = (s, e, r\,r%, ri,x), where rj £ N is any intermediary node. The nodes e, x, and ri, 
i € {1, I}, are also known as onion routers. We define the path of a circuit as the set of links 
(i.e., network connections) P = {oi, ...,0,1+2} associated to the Tor circuit, where a\ = (s,e), 
a-2 = (e,ri), a 3 = (n,r 2 ), ... , a t+1 = a t+2 = (n,x). The value \P\ = I + 2 is 

called the length of the circuit. A connection using the Tor network is composed by the client and 
destination nodes interconnected through a Tor circuit as follows: 



CI3 04 a t ai + i ai+i 

r\ — > r 2 — > ... — > n_i > r t > x 



Tor network 
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2.2 Adversary model 

The adversary assumed in our work relies on the threat model proposed by Syverson et al. in [7]. 
Such a pragmatic model considers that, regardless of the number of onion routers in a circuit, 
an adversary controlling the entrance and exit nodes would have enough information in order to 
compromise the communication anonymity of a Tor client. Indeed, when both nodes collude, and 
given that the entry node knows the source of the circuit, and the exit node knows the destination, 
they can use traffic analysis to link communication over the same circuit [8]. 

Assuming the model proposed in [7], then an adversary who controls c > 1 nodes over the n 
nodes in the Tor network can control an entry node with probability (^), and an exit node with 
probability (^). This way, the adversary may de-anonymise the traffic flowing on a controlled 
circuit (i.e., a circuit whose entry and exit nodes are controlled by the adversary) with probability 
(^) 2 if the length of the circuit is greater than two; or c ^^ 1 - > if the length of the circuit is equal to 
two (cf. [7] and citations thereof). Adversaries can determine when the nodes under their control 
are either entry or exit nodes for the same circuit stream by using attacks such timing-based attacks 
[9], fingerprinting [10], and several other existing attacks. 

Let us observe that the aforementioned probability of success assumes that the probability of a 
node from being selected on a Tor circuit is randomly uniform, that is, the boundaries provided in 
[7] only apply to the standard (random) selection of nodes, hereinafter denoted as random selection 
of nodes strategy. Given that the goal of our paper is to evaluate alternative selection strategies, we 
shall adapt the model. Therefore, let p\, p2, P3, ■ . ., p c be the corresponding selection probabilities 
assigned by the circuit construction algorithm to each node controlled by the adversary, then the 
probability of success corresponds to the following expression: 

{P1+P2+P3 + ■■■+ Pc) ■ {Pi + P2 + P3 + ■ ■ ■ + Pc) 

that can be simplified as: 

c „ 
i=l 

Following is the analysis. 

Theorem 1. Let c be the number of nodes controlled by the adversary. Let the Tor client use a 
selection criteria which, for a certain circuit, every node selection is independent. Let p\, P2, P3, 
. . ., p c be the corresponding selection probabilities assigned by the circuit construction algorithm 
to each node controlled by the adversary. Then, the success of the adversary to compromise the 
security of the circuit is bounded by the following probability: 

c „ 

(X» 

i=l 

Proof. The proof is direct by using the sum and product rules of probability theory, and taking into 
account that the selection of every node is an independent event. First, the probability of selecting 
the entrance or exit node in the set of nodes controlled by the adversary is (sum rule): 



c 

4 = 1 
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Then, the probability of selecting, at the same time, a controlled entrance and exit node in a circuit 
is (product rule): 

i—1 i—1 i—1 

Corollary 1. The Syverson et al. success probability boundary in [7], i.e., {^) 2 , is equivalent to 
the boundary defined in Theorem 1 when the circuit selection criteria is a random selection of 
nodes. 

Proof. Let N be the set of nodes deployed in a Tor network with n = \N\, and let A C N be the 
subset of nodes controlled by an adversary with c = \A\. The probability of a node m £ N to be 
selected is pi = -. Then, by applying it to the boundary defined in Theorem 1, we obtain: 

(£»)'-(.•*>' -(4)'-®' 



2.3 Anonymity degree 

Most work in the related literature has used the (Shannon) entropy concept to measure the anonymity 
degree of anonymisers like Tor (cf. [11, 12] and citations thereof). We recall that the entropy is a 
measure of the uncertainty associated with a random variable, that can efficiently be adapted to 
address new networking research problems [13-15]. In this paper, the entropy concept is used 
to determine how predictable is the selection of the nodes in accordance to a given strategy or, in 
other words, how easy is to violate the anonymity in relation to the adversary model defined in Sec- 
tion 2.2. Formally, given a probability space (]?, J 7 , P) with a sample space fi = {oj\,u)2, u n } 
where uji denotes the outcome of the node n, G N (Vi G {1, n}), a u-field T of subsets of 
Q, and a probability measure P on (J?, J r ), we consider a random discrete variable X defined as 
X : Q — > K that takes values in the countable set {x±, X2, x n }, where every value Xi G M 
corresponds to the node m G N. The discrete random variable X has a pmf (probability mass 
function) / : R — > [0, 1] given by = Pi = F(X = Xi). Then, we define the entropy of a 

discrete random variable (i.e., the entropy of a Tor network) as: 

n 

H(X) = -Y / Pflo92&) (1) 

i=l 

Since the entropy is a function whose image depends on the number of nodes, with property 
H(X) > 0, it cannot be used to compare the level of anonymity of different systems. A way 
to avoid this problem is as follows. Let Hm{X) be the maximal entropy of a system, then the 
entropy that the adversary may obtain after the observation of the system is characterised by 
Hm{X) — H(X). The maximal entropy Hm{X) of the network applies when there is a uni- 
form distribution of probabilities (i.e., F(X = Xi) = pi = ^, Vi G {1, n}), and this leads to 
H(X) = Hm{X) = log2(n). The anonymity degree shall be then be defined as: 

H M (X)-H(X) _ H(X) 

H M (X) H M (X) Kl) 

Note that by dividing Hm{X) — H(X) by Hm(X), the resulting expression is normalised. There- 
fore, it follows immediately that < d < 1. 
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2.4 Selection criteria 

Taking into account the aforementioned anonymity degree expression, we can now formally define 
a selection of Tor nodes criteria as follows. 

Definition 1. A selection of Tor nodes criteria is an algorithm executed by a Tor client s that, 
from a set of nodes N with n = \N\ and a length of a circuit 6, selects — using a given policy — 
the entrance node e, the exit node x, and the intermediary nodes ri, Vi G {1, 5 — 2}, and 
outputs its corresponding circuit C = (s, e, n, ...,r$-2, x) with a path P = {a\, ...,ag}, 
where a% = (s, e), a 2 = (e,n), a 3 = (ri,r 2 ), ... , a S -i = (rgs, r S - 2 ), a s = (rs-2,x). We 
use the notation convention ip(N, S) to denote the algorithm. The policy for the selection criteria 
of nodes can be modelled as a discrete random variable X that has a pmf f(x), and we use the 
notation ip(N,d) ~ f{x). 

3 Anonymity degree of three classical circuit construction strategies 

In this section, we present three existing strategies for the construction of Tor circuits, and elaborate 
on the conceptual evaluation of their anonymity degree. 



3.1 Random selection of nodes 

The random selection of Tor nodes is an algorithm ip rn d(N,5) ~ f rn d(x) with an associated 
discrete random variable X rn d- The procedure associated to this selection criteria is outlined in 
Algorithm 1. The selection policy of ipmd{N, 5) is based on uniformly choosing at random those 
nodes that will be part of the resulting circuit. Thus, the pmf f rn d(x) is defined as follows: 

frnd{Xi) Pi — X<i) 

n 

Hence, the entropy of a Tor network whose clients use a random selection of nodes is charac- 
terised by the following expression: 

n j ^ 

H rn d(X rnd ) = — > — • log 2 [ — = 
^-^ n \nJ 

i=l 

1 ™ 

/L ( Zo 52(!) - log 2 (n)) = log 2 {n) 



Theorem 2. The selection of Tor nodes ij)rnd{N, S) ~ fmd(x) with an associated discrete ran- 
dom variable X r „d gives the maximum degree of anonymity among all the possible selection al- 
gorithms. 

Proof. The proof is direct by replacing H rn d(X rn d) in Equation (2): 

_ H rnd (X rnd ) _ log 2 (n) _ 
H M (X rn d) log 2 (n) 
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3.2 Geographical selection of nodes 

The geographical selection of Tor nodes is an algorithm ip geo (N, S) ~ f ge o(x) with an associated 
discrete random variable X geo . Its selection method is based on uniformly choosing the nodes that 
belong to the same country of the client s that executes ip geo (N 1 5). The aim of this strategy is to 
reduce the latency of the communications using the Tor network, since the number of hops between 
Tor nodes of the same country is normally smaller than the number of hops between nodes that are 
located at different countries. Algorithm 2 summarises the procedure associated with this selection 
criteria. 

Formally, we define a function g c : K — > N that, given a certain node Xi G X geo , returns a 
number that identifies its country. Thus, given the specific country number K c of the client node s, 
the pmf fgeo{x) is characterised by the following expression: 



fgeo(Xi) = Pi = W(X = Xi) 



if g c (xi) = K c ; 
otherwise. 



where m = \{xi G X geo \ g c (xi) = K c }\. Then, the entropy of a system whose client nodes use a 
geographical selection for a certain country K c is: 



rn 

H geo (X geo ) = - V* log 2 ( — ) = log 2 {m) 



m 

i=l 



Therefore, by replacing the previous expression in Equation (2), the anonymity degree is equal to: 

log 2 (m) 



dgeo 



log 2 {n) 



Theorem 3. The maximum anonymity degree of a Tor network whose clients use a geographical 
selection of nodes is achieved iff all the nodes are in the same fixed country K c . 



Proof. (=>) Given d geo = l °f g j^\ for the country K c of a particular client s, we can impose the 
restriction of maximum degree of anonymity: 

log 2 (m 



dge ° log 2 {n) 



Algorithm 1 Random Selection of Nodes - ^ rn d{N, 5) 
Input: s.N, 5 

Output: C = (s, e, n,r 2 , rg- 2 , x), P = {al, ...,a s } 

M«- N 
C <- {s} 
for i 4— 1 to 5 do 

j random(l, |M|) 
C^Cu {rrij | m,j G M} 
P-t-PU{(ci,c i+ i)} 
M <- M \ {rrij | rrij G M } 

end for 



7 



Algorithm 2 Geographical Selection of Nodes - ip geo (N, 6) 
Input: s, N, S, K c 

Output: C = (s,e,r 1 ,r 2 , ...,r$- 2 ,x), P = {al, ...,a s } 

M <- {m e N g c {m) = K c } 

C^{s} 

for i <— 1 to S do 

j ^random(l, |M|) 
C^CU {mj | nij G M} 
P^PU{(c,c + i)} 
M <- M \ {mj | in j G M} 

end for 



Hence, 

log 2 (m) = log 2 {n) 

2log 2 (m) _ 2 lo 92(n) 

m = n 

(■<=) If gc(xi) = K c , Mxi G X geo , then we have that m = G X ffe o | ffc(^i) = ^c}| = 
Thus, 

_ log2(m) _ log2{n) _ ^ 
ge ° log 2 {n) log 2 (n) 

Theorem 4. Given a Tor network whose clients use the algorithm tp geo (N,d) ~ f geo (x) for a 
fixed country K c , and with an associated discrete random variable X geo , the anonymity degree 
is increased as m approaches n (i.e., m — > n), where m = \{xi G X geo \ g c (xi) = K c }\ and 
n = \N\. 

Proof. It suffices to prove that d geo is a monotonically increasing function. That is, we must prove 
that -Jj^{d geo ) > 0, Vm > 0. Therefore, the proof is direct by deriving, since the inequality: 

d / log 2 (m) \ = 1 > Q 
dm \ log 2 {n) ) m ■ log{n) 

is true Vm > and Vn > 1. We must notice that, from the point of view of a Tor network, the 
restriction of the number of nodes n > 1 makes sense, since a network with n < 1 nodes becomes 
useless as a way to provide an anonymous infrastructure. 

Figure 1 depicts the influence of the uniformity of the number of nodes per country on the 
anonymity degree. It shows, for a fixed country, the anonymity degree of four Tor networks in 
function of the nodes that are located in that country with respect to the total number of nodes of 
the network. The considered Tor networks have, respectively, 10, 50, 100 and 200 nodes. Their 
anonymity degrees are denoted as dio, d^, d wo and g?2oo- We can observe that the anonymity 
degree increases as the total number of nodes of the same country grows up (cf. Theorem 4). This 
fact can be extended until the maximum value of anonymity is achieved, which occurs when the 
number of nodes of the particular country is the same as the nodes that compose the entire network 
(cf. Theorem 3). 
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Fig. 1: Influence of the uniformity of the number of nodes per country in the anonymity degree for 



Theorem 5. Given a client s that uses as selection algorithm ^i geo {N, 5) in a Tor network with 
n = \N\, such that the network nodes belong to a p <^ n different countries, where p is the 
number of different countries in Tor network, then the best distribution of nodes that maximises the 
anonymity degree of the whole system is achieved iff every country has t = [^~\ nodes. 

Proof. Let p be the number of different countries of a Tor network, we can consider a collec- 
tion of subsets Si, S2, S p C jV such as |Ji=i = N an d Di=i Si = 0. Let ij be the number of 
nodes associated to the subset Si, i G {1, ...,p}. Then, the anonymity degree of the whole system 
is maximised when the sum of all the degrees of anonymity of every country equals 1 : 

hi l ° 92 ^ 

log 2 (ti) log 2 {t 2 ) , , log 2 (t p ) 



log 2 {n) log 2 (n) log 2 {n) 

2 lo 92(ti) _|_ 2 lo 92(t2) i _|_ 2 lo S2{tp) _ 2log 2 (n) 

t 1 +t 2 + ... + t. p = n 

However, to maximise the anonymity degree of the whole system implies also to have the same 
uncertainty inside every subset Si, i G {1, ...,p}, or, in other words, to have the same number of 
nodes in every subset. Hence, we have ti = t 2 = ... = t p = t and this leads to: 

h + t 2 + ... + t p ~n 
t + t + ... + t = n 

p times 

p ■ t = 11 

n 

t = - 
P 
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(<=) Given t = [-] be the number of nodes of a certain subset Si, i £ {1, we have 

Y^i—i = p ■ t = n. The pmf associated to i/jgeo(N, 5) is then / ffeo (x) = \ for each subset Si, 
i £ {1, Therefore, the entropy of each subset (i.e., country) is: 

H geo (X geo ) = - ^ t ■ ^2(7) = log 2 (t) 
i=i 

Hence, for each subset Si, i £ {1, p}, the anonymity degree can be expressed as follows: 

_ log 2 (t) 
log 2 (n) 

Suppose now, by contradiction, that there exists a unique S q £ {Si, S2, S p } for a particular 
country K q such that |S g | 7^ t, and its anonymity degree is expressed by d geo * = '°^£y . 
Then, taking into account that d 9eo and dg eo * are monotonically increasing functions (cf. proof of 
Theorem 4), we have two options: 

- If \S q \ < t -» d 9eo » < d geo 

— If |Sg| > £ d geo * > (igeo 

But this is not possible since: 



(p 





= n 




i=l 






l)t+|S,| 


= n 




1^1 


= n — 


t(p- 




= n — 


%- 
P 




n 
P 





which implies that rfgeo* = d ffe o> contradicting the above two options. 



3.3 Bandwidth selection of nodes 

The bandwidth selection of nodes strategy is an algorithm ipbw (N, 5) ~ fbw (x) with an associated 
discrete random variable X^w whose selection policy is based on choosing, with high probabil- 
ity, the nodes with best network bandwidth. The procedure associated to this selection criteria is 
outlined in Algorithm 3. The aim of this strategy is to reduce the latency of the communications 
through a Tor circuit, specially when the communications imply a great rate of data exchanges. At 
the same time, this mechanism provides a balanced anonymity degree, since the selection of nodes 
is not fully deterministic from the adversary point of view. 

In this strategy, the entropy and the anonymity degree can be described formally as follows. 
First, we define a bandwidth function gj, w : K — > N that, given a certain node x, £ Xj, w , returns 
its associated bandwidth. Then, the pmf fb w (x) is defined by the expression: 

J- I \ Wb/ v \ 9bw{.Xi) 

Jbw(Xi) = Pi = ¥(X bw = Xj) = — 
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Algorithm 3 Bandwidth Selection of Nodes - (N, S) 
Input: s,N,5 

Output: C = (s,e,ri,rz, ...,rs-2,x), P = {al, o«} 

/* Compute a weighted well-ordered set */ 

M «- {m € N I g bw (rii) < g bw {ni+i)} 

T bw «- J2 9bw(m,i), Vm* £ M 

i = l 

for i «— 1 to n do 

W <- W U I (m !; £ g6 ^ j) ) | Vm„m 3 6 m] 

end for 

/* Compute the nodes of the circuit C */ 

C <- { S } 

for i <— 1 to S do 

r?id -s— random(0, 1) 
Select a tuple {rrij, bwj) £ W, where 
rrij £ C and 
rnd € [bwj , bwj+i ) 
C* <- C*U {m,} 
P^Pu{( Cl ,c l+1 )} 

end for 



where Tb w = 9bw is the total bandwidth of the Tor network. Hence, the entropy of a system 

i=l 

whose clients use a bandwidth selection of nodes strategy is: 

IT (Y\ \^9bw(Xi) f 9bw(x l )\ 

Hbw{x) = -^^r- log \-n^ 

1=1 x ' 

By replacing Hb w (X) in Equation (2), the anonymity degree is, then, as follows: 

lbw ■ log2{n) \ Tbw J 

Theorem 6. Given a selection of Tor nodes ij}bw{N,8) ~ fbw(x) with an associated discrete 
random variable Xb w , the maximum anonymity degree is achieved iff gbw{xi) = Kb w Vx, 6 Xb w , 
where Kb w is a constant. 

Proof. (=>) H(Xb w ) — HM(Xbw) would imply that the anonymity degree gets maximum. This 
is only possible when fbw(xi) = gb ™ ^'^ = i, \fxi € X^. Therefore, 

g6w(-Ei) 
P\yw 

9bw {Xi ) 



1 

n 

77 



11 



1 



L 


ipbw(N,S) 


















^^^^ 











































20 40 60 80 100 

Nodes with the same bandwidth of a total of 100 

Fig. 2: Influence of the uniformity of the bandwidth distribution in the anonymity degree for 



and since Tt, w and n are constant values for a certain Tor network, we can consider that gbw{xi) is 
also a constant, Vxj e Xf, w ■ 

(<=) Given fbw(xi) = 9b ^ Xi ^ it is easy to see that if gb w (xi) = Kbw Vxi £ X bw then Tb w = 
Yh=x 9bw{xi) = n ■ K bw and, as a consequence, f bw {x.i) = = ^ Vi, £ X bM) . Hence, by 

replacing f bw (x t ) = ± in Equation (2), we get = 1. 

Figure 2 shows the relation between the uniformity of the bandwidth of the nodes and the 
anonymity degree of the whole system. It depicts the anonymity degree of a Tor system with 100 
nodes, measured under different restrictions. In particular, the bandwidth of the nodes has been 
modified in a manner that a certain subset of nodes has the same bandwidth, and the bandwidth of 
the remainder nodes has been fixed at random. During all the measurements the total bandwidth of 
the system Tb w remains constant. As the size of the subset is increased, and more nodes have the 
same bandwidth, the uncertainty is higher from the point of view of the discrete random variable 
associated to ipbw{N, 5). Therefore, the anonymity degree is increased when the uniformity of the 
distribution of the bandwidths grows. 



4 New strategy based on latency graphs 

We present in this section a new selection criteria. The new strategy relies on modelling the Tor 
network as an undirected graph G(V, E), where V = N U {s} denotes the set composed by the Tor 
nodes N = v n } and the client node v n+ i = s, and where E = {en-, e i3, e ij} denotes 

the set of the edges of the graph. We use the notation = (u,; , Vj ) to refer to the edge between 
two nodes Vi and vj. The set of edges E represents the potential connectivity between the nodes 
in V, according to some partial knowledge of the network status which the strategy has. If an edge 
tij = (vi,Vj) is in E, then the connectivity between nodes Vi and Vj is potentially possible. The 
set of edges E is a dynamic set, i.e., the network connectivity (from a TCP/IP standpoint) changes 
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periodically in time, while the set of vertices V is a static set. Finally, and although the network 
connectivity from node Vi to Vj is not necessarily the same as the connectivity from Vj to Vi, we 
decided to model the graph as undirected for simplicity reasons. Our decision also obeys to the 
two following facts: (i) in a TCP/IP network, the presence of nodes is more persistent than the 
connectivity among them; and (ii) the connectivity is usually the same from a bidirectional routing 
point of view in TCP/IP networks. 

Related to the edges of the graph G(V,E), we define a function c t : E — > RU {00} such that, 
for every edge £ E, the function returns the associated network latency between nodes i>i and 
Vj at time t. If there is no connectivity between nodes Vi and Vj at time t, then we say that the 
connectivity is undefined, and function a returns the infinity value. Notice that function ct can be 
implemented in several ways. Some previous work in the field include software tools to monitor 
the network based on IP geolocation [16], modelling of networks as stochastic systems [17], and 
network tomography [18]. Regardless of the strategy used to implement c t , there is an important 
restriction from a security point of view: leakage of sensitive information in the measurement 
process shall be contained. This mandatory constraint must always be fulfilled. Otherwise, an 
adversary can benefit from a monitoring process in order to degrade the anonymity degree. 



Algorithm 4 Latency Computation Process - lat_comp(G(V, E), At, m) 
Input: G{V,E),At,m 

to ^t q ^0 
E <r- 

L i e ij) <~ (<X>,to) 

while TRUE do 

tq 4- tg + 1 

for i 4— 1 to m do 

i,j 4— random(l, |V|), i 7^ j 
l q <- ct(ei,) 

if lq = 00 then 

E<-E\{ eij } 

else 

E<-E\J{eij} 

Given L(ejj) = (l p ,t p ) 

if l p ^ 00 then 

a <- (t p - t )/(t q - t ) 
lq 4- a ■ l p + (1 — a) ■ l q 

end if 

L(ey) «- (lq,tq) 

end if 

end for 

sleep(At) 

end while 



Given the aforementioned rationale, we propose now the construction of our new selection 
strategy by means of two general processes. A first process computes and maintains the set of edges 
of the graph and its latencies. The second process establishes, according to the outcomes provided 
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Algorithm 5 K-paths Computation Process - kpaths(G(V, E),8,k, xjnode, curjpath, paths_list) 
Input: G(V, E),S, k, xjnode, curjpath, pathsjist 

if len(paths_list) = k then 
return 

end if 

if len(cur_path) > S then 
return 

end if 

vi last_vertex(«tr_pat/i) 

new_len ^-\en(curjpath)+\ 

adjacency _list «— adjacent_vertices(G(V, E), vi) 

remove _rm&e.s(adjacency_list, curjpath) 

random_shuffle(aci7 acencyjtist) 

for vertex in adjacency _list do 

if vertex = xjnode and new_len < 5 then 
continue 

end if 

if vertex = xjnode and new_len — S then 
new_sol •<— curjpath + (vertex) 
paths_list «— pathsjlist + (new_sol) 
break 

end if 

curjpath <— curjpath + (vertex) 

kpaths(G(V, £7), 5, fc, xjnode, curjpath, pathsjlist) 

end for 



by the first process, circuit nodes. Circuit nodes are chosen from those identified within graph paths 
with minimum latency. These two processes are summarised, respectively, in Algorithms 4 and 6. 
A more detailed explanation of the proposed strategy is given below. 

The first process (cf. Algorithm 4) is executed in background and keeps a set of labels related to 
each edge. Every label is defined by the expression L(eij) = (I, t), where e,j denotes its associated 
edge. The label contains a tuple (/, t) composed by an estimated latency I between the nodes of 
the edge (i.e., Vi and Vj), and a time instant t which specifies when the latency I was computed. 
When the process is executed for the first time, the set of edges and all the labels are initialised as 
E <- and L(e i3 ) 4- (oo, 0). 

At every fixed interval of time At, the process associated to Algorithm 4 proceeds indefi- 
nitely as follows. A set of m edges associated to the complete graph K n with the same vertices 
of G(V, E) are chosen at random. The latency associated to every edge is estimated by means of 
the aforementioned function c t . If the computed latency is undefined (i.e., function ct returns the 
infinity value), then the edge is removed from the set E (if it was already in E) and the associated 
latency labels not updated. Otherwise, the edge is added to the set E (if it was not already in E), 
and the value of its corresponding labels updated. In particular, the latency member of the tuple 
is modified by using a exponentially weighted moving average (EWMA) strategy [19], and the 
time member is updated according to the current time instant t q . For instance, let us suppose that 
we are in the time instant t q and we have chosen randomly the edge with an associated label 
Lie-ij) = (l p , t p ). Let us also suppose that l q = c tq (e,j) is the new latency estimated for such an 
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Algorithm 6 Graph of Latencies Selection of Nodes - ip gr p(N, S) 

Input: G(V, E), s, S, k, maxjiter, At 

Output: C = (s,e, r 1 ,r 2 , —, rs-2,x), P = {al, a s } 

P <- 

paths_list {) 
iter 

/* Executed in background as a process */ 
lat_comp(G(V, E), At, m) 

repeat 

curjpath 4— (s) 

x_node -s— randorrLvertexCl/ \ {s}) 
kpaths(G(V, E), 8, k, x_node, cur jpath, paths _list) 
iter <— iter + 1 
until (not empty(paths_list)) or (iter = maxjiter) 

if not empty(paths_list) then 

G <— min_weighted_path(patfes_Zis£) 

else 

C <— random_path(V^, 5) 

end if 

for i <— 1 to 8 — 1 do 



edge. Thus, its corresponding label is updated according to the following expression: 



The first case of the previous expression corresponds to a situation of disconnection between the 
nodes of the edge e^, and that has been detected by the function c t . As a consequence, c t (ey) 
returns infinity. In this case, the previous estimated latency l p is maintained in the tuple, and the 
edge eij is removed from E. The second case can be associated to the first time the latency of the 
edge dj is estimated using c t , since the previous latency was undefined and the infinity value is 
the one used in the first instantiation of L(eij). Under the two last cases of the previous expression, 
the edge aj is always added to the set E if it still does not belong to the aforementioned set. The 
third scenario corresponds to the EWMA in the strict sense. In this case, the coefficient a 6 (0, 1) 
represents a smoothing factor. The value a has an important effects in the resulting estimated 
latency stored in L(e.y). Notice that those values of a that are close to zero give a greater weight 
to the recent measurements of the latency through the function c tq . Contrary to this, a value of a 
closer to one gives a greater weight to the historical measurements, making the resulting latency 
less responsive to recent changes. 

For the definition of the a factor we must consider that the previous update of the latency — for 
a certain edge — could have been performed long time ago. This is possible since, for every interval 
of time At we choose randomly just only m edges to update their latencies. Indeed, the value of 
l p in the previous example could have been computed at the time instant t p , and where t p <C t q . 



P <- PU{(ci,Ci+i)} 



end for 
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Therefore, if we define a as a static value, the weight for previous measurements will always be 
the same, independently of when the measurement was taken. This is not an acceptable approach 
since the older the previous measurement is, the less weight should have in the resulting computed 
latency. 

To overcome this semantic problem, the coefficient a must be defined as a dynamic value that 
takes into account the precise moment in which the previous latencies were estimated for every 
edge. In other words, a should be inversely proportional to the size of the time interval between 
the previous measurement and the current one. In order to define a as a function of this time 
interval, we must keep the time instant of the previous latency estimation for a given edge. This 
can be accomplished by storing the time instants in the tuple of every edge label. Hence, every 
time we select at random m edges to update their latencies, its associated time members of its 
labels must be updated with the current time instant t q . It is important to remark that this update 
process must be done just only when the function c t returns a value different from the infinity one. 
Moreover, for a selected edge e,j in the time instant t q , its a value is defined as: 

t p — to 
a = — 

t q — to 

where i is the first time instant when the execution of the process started. A graphical interpre- 
tation of the previous expression is depicted in Figure 3. We can appreciate that a 6 (0, 1) by 
associating the numerator and the denominator of the expression with its interval representation 
in the figure. Thus, we can directly deduce that < (t p — to) < (t q — to) and, consequently, 
a G (0, 1). In this figure, we can also see the influence of the previous time instant t p on the re- 
sulting a. In particular, three cases are presented: a) t p <C t q , b) t p rj tq ~ to ) and c) t p « t q . For 
these cases, we can observe how a tends to, respectively, 0, 0.5 and 1, 
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Fig. 3: Graphical interpretation of the a coefficient 

The second process (cf. Algorithm 6) is used for selection of circuit nodes. It utilises the in- 
formation maintained by the process associated to Algorithm 4. In particular, the graph G(V, E) 
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and the labels L(ey) Ve.y G £7 are shared between both processes. When a user wants to construct 
a new circuit, this process is executed and it returns the nodes of the circuit. For this purpose, an 
exit node x is chosen at random from the set of vertices V \ {s}. After that, the process computes 
until k random paths of length S between the nodes s and x. With this aim, a recursive process, 
summarised in Algorithm 5, is called. In the case that there is not any path between the vertices 
s and x, another exit node is chosen and the procedure is executed again. This iteration must be 
repeated until a) some paths of length 5 between the pair of nodes s and x are found, or b) until a 
certain number of iterations are performed. In the first case, the path with the minimum latency is 
selected as the solution among all the obtained paths. In the second case, a completely random path 
of length 8 is returned. To avoid this situation, i.e., to avoid that our new strategy behaves as a ran- 
dom selection of nodes strategy, the process associated to Algorithm 4 must be started some time 
before the effective establishment of circuits take place. This way, the graph G(V, E) increases the 
necessary level of connectivity among its vertices. We refer to Section 6 for more practical details 
and discussions on this point. 

4.1 Discussion on the adversary model 

One may think that an adversary, as it was initially defined in Section 2, can try to reconstruct the 
client graph and guess the corresponding latency labels of our new strategy in order to degrade 
its anonymity degree. However, even if we assume the most extreme case, in which the adversary 
obtains a complementary complete graph K n with the set of vertices N and corresponding latency 
labels, this does not affect the anonymity degree of our new strategy. First of all, we recall that the 
graph of the client is a dynamic random subgraph of K n+ i that is evolving over time, with a set 
of vertices N U {s}. The adversary graph would also be a subgraph of K n with the set of vertices 
N, changing dynamically as time goes by. Therefore, the set of vertices and edges of the adversary 
and client graphs will never converge into same connectivity model of the network. Moreover, 
the latencies between the client node s and any other potential entry node e cannot be calculated 
by the adversary. Otherwise, this would mean that the anonymity has already been violated by 
the adversary. Indeed, the estimated latencies will definitively differ between the client and the 
adversary graph, since they are computed at different time frames and different source networks. 
Finally, the adversary also ignores the exit nodes selected by the client, as well as the k parameter 
used by the client to choose the paths. 

5 Analytical evaluation of the new strategy 

We provide in this section the analytical expression of the anonymity degree of the new strategy. 
First, we extend the list of definitions provided in Section 2. 

5.1 Analytical graph of tp grp (N, S) 

In order to provide an analytical expression of the anonymity degree it is important to notice that 
this must be always done from the adversary standpoint. In this regard, the graph to be considered 
for this purpose differs with respect to the one used to compute a circuit. Note that the latencies 
associated to every edge which contains the client node s cannot be estimated by the adversary — 
specially if we consider that this particular node is unknown by the adversary. Hence, an adversary 
aiming at violating the anonymity of client node s could try to estimate the user graph without node 
s and its associated edges. This leads us to the following definition (cf. Figure 4 as a clarifying 
example): 
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G'(V, E') 

G(V,E) 

Fig. 4: Example of a latency graph and its analytical graph with a selected circuit C = 

(s,v 2 ,v 3 ,v 5 ) of length^ = 3 



Definition 2. Given a latency graph G(V, E) associated to a selection of Tor nodes ip grp (N, 6) 
strategy and the client node s, we define the analytical graph as G'(V' , E') where V' = V \ {.s} 
andE' = E\{{s,Vi)}Vvi e V. 



5.2 A-betweenness and A-betweenness probability 

For the purpose of computing the degree of anonymity of our new strategy, a new metric in- 
spired by the Freeman's betweenness centrality measure [20] is presented. This metric, called 
A-betweenness, is defined as a measurement of the frequency which a node v is traversed by all 
the possible paths of length A in a graph. The formal definition is given below. 

Definition 3. Consider an undirected graph G(V, E). Let KP st denote the set of paths of length 
A between a fixed source vertex s G V and a fixed target vertex t G V. Let KP st (v) be the subset 
of KP st consisting of paths that pass through the vertex v. Then, we define the X-betweenness of 
the node v € V as follows: 

KP B {v,\) = °- 



s,tev 

where er st (A) = \KP st \ and, a st (v, A) = \KP st {v)\. 

As we can observe, the A-betweenness provides the proportion between the number of paths of 
length A which traverses a certain node v, and the number of the total paths of length A. However, 
since the degree of anonymity needs a probability distribution, the following definition is required. 
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Definition 4. Consider an undirected graph G(V, E). Let KPb{v, A) be the X-betweenness of the 
node v G V. Then, the X-betweenness probability of the node v is defined as: 

LB (v, A) = -gg*!i*3 = 

It follows immediately that < LB(v, A) < 1, Vu G V, since this expression is equivalent to the 
normalised A-betweenness. 



5.3 Entropy and anonymity degree 

The graph of latencies selection of Tor nodes is defined formally as an algorithm i^ grp (N, 5) ~ 
f grp {x) with an associated discrete random variable X grp and an analytical graph G'(V , E'). The 
pmf f g rp{x) is given by means of the A-betweenness probability expression: 

^ <Tex{Vi,\) 

f grp{%i) ~~ Pi ~~ grp ~ %i) 



E E CT ^(w,A) 

wEV e,xeV 

where e and x denotes every potential entry and exit node respectively in a Tor circuit, and A = 
5 — 1 . It is worth noting that the value A = 5 — 1 makes sense only if we take into consideration 
that the client node s and its edges are removed in the analytical graph respect to the latency graph. 
Hence, the entropy of a system whose clients use a graph of latencies selection of nodes strategy 

is: 

n 

H grp (X) = -J2LB(v i ,X)-log 2 (LB(v i ,X)) 

i=l 

By replacing H grp (X) in Equation (2), the degree of anonymity is then: 

d a r P = -Y^^--log,{LB{v i ,X)) 

Theorem 7. Given a selection of Tor nodes ip grp (N,d) ~ f grp (x) with an associated discrete 
random variable X grp and an analytical graph G'(V',E') with n = \V'\ and m = \E'\, the 
anonymity degree is increased as the density of the analytical graph grows. 



Proof. The density of a analytical graph G' = (V, E') measures how many edges are in the set 
E' compared to the maximum possible number of edges between vertices in the set V'. Formally 
speaking, the density is given by the formula n ^ T " 1 - ) ■ According to the previous expression, and 
since the number of nodes of the analytical graph remains constant, the only way to increase the 
density value is through rising the value to; that is, by adding new edges to the graph. Obviously, 
this implies that the more number of edges the analytical graph has, the more its density value is 
augmented. 
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Moreover, if we increase the density of the analytical graph by adding new edges, then the 
A-betweenness probability of each vertex will be affected. In particular, the denominator of the 
A-betweenness probability expression will change for all the vertices in the same manner, whereas 
the numerator will be increased for those vertices that lie on any new path of length A which 
contains some of the added edges. However, this increase is not arbitrary for a given vertex, since 
it has a maximum value determined by the total amount of paths of length A which traverses such 
vertex. Therefore, we can consider that each vertex has two states while we are adding new edges. 
First, a transitory state where the graph does not include all the paths of length A that traverse such 
vertex. And second, a stationary state which implies that the graph has all the paths of length A 
that traverses the given vertex. Thus, if we add new edges at random, then the numerator of the 
A-betweenness probability of each vertex should be increased uniformly. Consequently, the degree 
of anonymity grows when the density of the graph is augmented. 

It is interesting to highlight that the numerator of the A-betweenness probability of a certain 
vertex will be increased while it is in a transitory state, and until the vertex achieves its stationary 
state. After that, such value cannot be increased. It seems obvious that the degree of anonymity 
associated to a particular analytical graph will be reached when all the vertices are in a stationary 
states; or, in other words, when it is the complete graph. Let us formalize this through the following 
theorem. 

Theorem 8. Given a selection of Tor nodes ipg rp (N,d) ~ f grp {x) with an associated discrete 
random variable X grp and an analytical graph G'(V , E') with n = \V'\, the maximum anonymity 
degree is achieved iffG'(V', E') is the complete graph K n . 



Proof. Let us suppose that G' ( V , E') is not the complete graph K n . The maximum anonymity 
degree will be achieved when LB(vi, A) is equiprobable for all G V. That is: 

2J o- ex (vi,X) 

e,xeV 

where A = 6 — 1, and where e and x represents every possible entry and exit node of a circuit 
respectively. The previous expression can be rewritten as follows: 

^ &ex{vi, A) + ••• + ^2 &ex(v n ,\) 

<Tex{v t , A) = 

n 

e,xGV' 

Let us now suppose that the value ^ e xeV , a ex (vi, A) is fixed for every node of the analytical 
graph in accordance to the previous expression. Then, since G'(V, E') is not the complete graph 
K n , we can eliminate an arbitrary edge such that the number of paths of length A with entry node 
e and exit node x, and which traverses a given particular node Vj G V, is reduced. Thus, the 
value of J2 e xeV a ex{vj, A) would be affected for that given node. However, this contradicts the 
previous expression, since ^ e x£V , a ex (vi, A) would take different values for distinct nodes, and 
when such value must be the same for any node of the graph. 



= - Wvi G V 
n 
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(<=) Let us suppose, by contradiction, that the maximum anonymity degree is not achieved by the 
analytical graph K n associated to ip grp (N, S). This implies that given two different nodes Vj and 
Vk of the graph K n , they will not have the same probability of being chosen by if! grp (N, S); that 
is, LB(vj, A) ^ LB(vk, A). Then, since LB(v, A) is defined as follows: 



^2 Vex{v,\) 



LB(v,X) 



e,xGV' 



We can consider that the only factor which makes possible the previous restriction LB(vj, A) ^ 
LB(vk, A) is in the numerator, because the value of the denominator remains equal for both nodes 
in a fixed graph. Thus, if we want to satisfy the previous restriction, we must change the value 
S e xeV a ex(v, A) of either node Vj or node Vk- However, this is only possible if we eliminate 
a particular edge of the graph. This contradicts the imposed premise that the analytical graph 
associated to ip grp (N, 5) was the complete graph K n . 

Theorems 7 and 8 are exemplified in conjunction in Figure 5. We can observe how a density 
increase of an analytical graph influences in the degree of anonymity, achieving its maximum value 
when the graph is the complete one (i.e., it has a density equal to one). 
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Fig. 5: Influence of the density of the analytical graph in the degree of anonymity with \V'\ 
and S = 3 
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Theorem 9. Let G(V, E) be a undirected graph with n = \ V\ and let A be a fixed length of a path, 
the value of o~ st (\) is maximised iff G(V, E) is the complete graph K n . 
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Proof. Let us suppose, by contradiction, that G(V, E) is not the complete graph K n . Then, 
we can choose an arbitrary edge G E that belongs to a path of length A between the nodes s and 
t. Then, we can remove from E since the graph is not complete. As a consequence, the value 
KP st will be reduced. However, this contradicts the fact that the value cr st (X) must be maximum 
since cr st (A) = \KP st \. 

(<=) The proof is direct, since the complete graph K n contains all the possible edges between its 
nodes, and thus KP st consists of all the possible paths of length A between the nodes s and t. 



Theorem 10. Let K n be a complete graph, the total number of paths of length A between any pair 
of vertices s and t is given by the expression: 

J2 MA) = ((n-l)((n-l) A -(-l) A )) 

s,tev 

Proof. The proof is given in Appendix A. 

Theorem 11. Given a selection of Tor nodes il) grp {N, 8) <~ f gr p(x) with an associated discrete 
random variable X grp and an analytical graph G'(V',E'), the maximum anonymity degree is 
achieved iff 

J2 <Tex(A) = ((n-l)((n-l) A -(-l) A )) 
Proof. The proof is direct by applying Theorems 8, 9 and 10. 

6 Experimental results 

We present in this section a practical implementation and evaluation of the series of strategies 
previously exposed. Each implementation has undergone several tests, in order to evaluate latency 
penalties during Web transmissions. Additionally, the degree of anonymity of every experimental 
test is also estimated, for the purpose of drawing a comparison among them. 

6.1 Node distribution and configuration in PlanetLab 

In order to measure the performance of the strategies presented in our work, some practical exper- 
iments have been conducted. In particular, we deployed a private network of Tor nodes over the 
PlanetLab research network [21, 22]. Our deployed Tor network is composed of 100 nodes follow- 
ing a representative distribution based on the real (public) Tor network. We distributed the nodes of 
the private Tor network following the public network distribution in terms of countries and band- 
widths. Table 1 summarises the distribution values per country. The estimated bandwidths of the 
nodes is retrieved through the directory servers of the real Tor network [23]. Then, we categorised 
the nodes according to their bandwidths by means of the k-means clustering methodology [24, 25]. 
A value of k = 100 is used as the number of clusters (i.e., number of selected nodes in Planet- 
Lab). When the algorithm converges, a cluster is assigned randomly to each node of the private 
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Tor network. Subsequently, the bandwidth of each node is configured with the value of its associ- 
ated centroid (i.e. the mean of the cluster). For such a purpose, the directive BandwidthRate is 
used in the configuration file of every node. Let us note that the country and bandwidth values are 
considered as independent in the final node distribution configuration. Indeed, there is no need to 
correlate both variables, since the bandwidth of every node can be configured by its corresponding 
administrator, while this fact does not depend on the country which the node belongs to. 

6.2 Testbed environment 

Every node of our Planetlab private network runs the Tor software, version 0.2.3.1 1-alpha-dev. 
Additionally, four nodes inside the network are configured as directory servers. These four nodes 
are in charge of managing the global operation of the Tor network and providing the information 
related to the network nodes. 

Furthermore, two additional nodes outside the PlanetLab network are used in our experiments. 
One of them is based on an Intel Core2 Quad Processor at 2.66GHz with 6GB of RAM and a 
Gentoo GNU/Linux Operating System with a 3.2.9 kernel. This one is used as the client node who 
handles the construction of Tor circuits for every evaluated strategy. For this purpose, this node 
runs also our own specific software application, hereinafter denoted as torspd.py. A beta re- 
lease of torspd . py, written in Python 2.6.6, can be downloaded at http : / /github . com/ 
sercas/torspd. The torspd . py application relies on the TorCtl Python bindings [26] — a 
Tor controller software to support path building and various constraints on node and path selection, 
as well as statistic gathering. Moreover, torspd.py also benefits from the package NetworkX 
[27] for the creation, manipulation, and analysis of graphs. The client node is not only in charge 
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Table 1 : Selected PlanetLab nodes per country according to the real Tor network distribution 
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(PlanetLab nodes) 



Destination Server 




Fig. 6: Conceptual representation of our testbed environment 



of the circuit construction given a certain strategy, but also of attaching an initiated HTTP con- 
nection to an existing circuit. To accomplish this, the node uses torspd.py to connect to an 
special port of the local Tor software called the control port, and which allows to command the 
operations. The client node includes an additional software — also based on Python — capable of 
performing HTTP queries through our private Tor network by using a SOCKS5 connection against 
the local Tor client. This software, called webspd.py, is also able to obtain statistics results 
about the launched queries in order to evaluate the performance of the algorithms implemented 
in torspd . py. Finally, webspd . py performs every HTTP query making use directly of the IP 
address of the destination server; consequently, any perturbation introduced by a DNS resolution is 
avoided in our measurements. The second node outside the PlanetLab network is based on an Intel 
Xeon Processor at 2.00GHz with 2GB of RAM and a Debian GNU/Linux Operating System with 
a 2.6.26 kernel. This node is considered as the destination server, and includes an HTTP server 
based on Apache, version 2.2.21. The conceptual infrastructure used to carry out our experiments 
is illustrated in Figure 6. 

With the purpose of obtaining extrapolative results, we consider in our testbed the outcomes 
reported in [28]. This report, based on the analysis of more than four billion Web pages, provides 
estimations of the average size of current Internet sites, as well as the average number of resources 
per page and other interesting metrics. Our testbed is built bearing in mind these premises, so 
that it is close enough to a real Web environment. This way, the analysed strategies (i.e., random 
selection, geographical selection, bandwidth selection, and graph of latencies selection) are eval- 
uated based on three different series of experiments that vary the Web page sizes. More precisely, 
the client node requests via our private PlanetLab Tor network Web pages of, respectively, 50KB, 
150KB and 320KB of size — being the last one the average size of a Web page according to the 
aforementioned report. The length of the circuits is seen as another variable in our testbed. More 
precisely, the different strategies are evaluated with Tor circuits of length three, four, five and six. 
Every experiment is repeated 100 times, from which we obtain the minimum, maximum and av- 
erage time needed to download the corresponding Web pages. Likewise, the standard deviation is 
computed for every test. The obtained numerical results are presented in Tables 2, 3, 4 and 5, and 
also depicted graphically in Figure 7. In the sequel, we use these results to analyse the performance 
of every strategy in terms of transmission times and degree of anonymity. 
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Fig. 7: Experimental results 
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6.3 Random selection of nodes strategy evaluation 

As previously exposed in Theorem 2, the random selection of nodes strategy is the best one from 
the point of view of the degree of anonymity, since it achieves the maximum possible value. Nev- 
ertheless, this selection of nodes methodology suffers from an high penalty in terms of latency in 
accordance with the extrapolated results of our evaluation. As it can be inferred from the analysis 
of the numerical outcomes, and reflected in Figure 7, the random selection algorithm exhibits the 
worst transmission times, regardless of the size of the site or the length of the circuit used. This can 
be explained by the random nature of this strategy. Indeed, by selecting the nodes at random, the 
strategy can incur in some problems which affect directly to the latency of a computed circuit, such 
as a big distance between the involved nodes (in terms of countries, i.e., routers), a network con- 
gestion in a part of the circuit [29], or a selection of nodes with limited computational resources, 
among others. It is clear that all these drawbacks are hidden to the strategy and explain the obtained 
results. Moreover, all these problems are reflected in the standard deviation of the measurements, 
which is the higher one compared with the other alternatives. 



ipmd(N, 5), d rnd = 1.0, Web size 50KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


0.95094203949 


3.38077807426 


1.84956678152 


0.58107725003 


5 = 4 


1.14792490005 


7.46992301941 


2.56735023022 


1.03927644851 


5 = 5 


1.13161778450 


12.7252390385 


3.13187572718 


1.69722167190 


5 = 6 


1.57145905495 


14.6901309490 


3.56973065615 


2.06960596616 


ipmd(N, 5), d rnd = 1.0, Web size 150KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


0.970992088318 


5.70451307297 


2.46016269684 


0.901269612931 


5 = 4: 


1.081045866010 


12.0326070786 


3.34545367479 


1.478886535440 


5 = 5 


1.624027013780 


16.0551090240 


3.78437126398 


1.918732505410 


5 = 6 


2.279263019560 


11.5805990696 


4.71352141102 


2.544477101520 


ipmd(N, 5), d rnd = 1.0, Web size 320KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


1.49153804779 


13.2033219337 


3.79921305656 


2.45165379541 


5 = 4 


1.84271001816 


15.2616338730 


4.98011079788 


2.67792560196 


5 = 5 


1.73619008064 


17.1969499588 


5.37626729012 


3.01781647919 


5 = 6 


2.16737580299 


17.8402540684 


6.37420113325 


3.27889183837 



Table 2: Random selection of nodes strategy (ipmd) results 



6.4 Geographical selection of nodes strategy evaluation 

The evaluation of the geographical selection of nodes strategy has been performed by fixing the 
country and taking into consideration the node distribution detailed in Table 1 . United States was 
selected in accordance to the country where the client node resides. Therefore, we can calculate 
the anonymity degree for this strategy by recalling its related expression introduced in Section 3.2: 

_ log 2 (m) _ lo.g 2 (27) „ 71t - 7 
dse °- log 2 (n) - log 2 (W0)~ U - ni)< 



ipgeo(N, 5), d geo » 0.7157, Web size 50KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


0.913872003555 


2.36748099327 


1.31694087505 


0.219359721740 


5 = 4 


1.083739995960 


2.03739213943 


1.49165359974 


0.189194613865 


5 = 5 


1.157481908800 


2.17184281349 


1.56993633509 


0.220167861127 


5 = 6 


1.200492858890 


2.63958501816 


1.71368015051 


0.234977785757 


i>geo(N,6), d geo « 0.7157, Web size 150KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


1.38168692589 


2.68786311150 


1.79467165947 


0.260276001481 


5 = 4 


1.27939105034 


2.92536497116 


1.87463890314 


0.281488220772 


5 = 5 


1.33843898773 


3.71059083939 


1.98130603790 


0.318113252410 


5 = 6 


1.40922594070 


3.28039193153 


2.05482839346 


0.261217096578 


tpgeo(N, 5), d 9eo « 0.7157, Web size 320KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


1.41799902916 


2.93465995789 


2.20432470083 


0.310828573513 


5 = 4 


1.54156398773 


3.33606600761 


2.37035997391 


0.329438846284 


5 = 5 


1.88031601906 


4.10431504250 


2.51430423737 


0.370494801277 


5 = 6 


1.64570999146 


3.89323496819 


2.70262962818 


0.376313686885 



Table 3: Geographical selection strategy (ip geo ) results 



ipbv,(N, 5), dbw « 0.9009, Web size 50KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


0.964261054993 


5.12318110466 


1.86709306002 


0.789060168081 


5 = 4 


1.078310012820 


5.41474699974 


2.36407416582 


0.859666129425 


5 = 5 


1.060457944870 


6.92380499840 


2.63418945789 


1.128347022810 


5 = 6 


1.278292894360 


12.7536408901 


3.03272451162 


1.882337407440 


ipbw(N, 5), d bw w 0.9009, Web size 150KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


1.26475811005 


7.09091401100 


2.28234255314 


0.765374484505 


5 = 4 


1.23797798157 


6.80870413780 


2.91089500189 


0.947719280103 


5 = 5 


1.45632719994 


12.6443610191 


2.97445464373 


1.431690789930 


5 = 6 


1.27809882164 


12.7246098518 


3.19875429869 


1.666334473980 


ipbw(N, 5), d bw « 0.9009, Web size 320KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


1.49932813644 


12.9250459671 


3.29500451326 


1.79104222251 


5 = 4 


1.52931094170 


13.7227480412 


3.70603173733 


1.90767488259 


5 = 5 


1.66296601295 


17.3828690052 


4.07738301039 


2.18405609668 


5 = 6 


2.04065585136 


20.1761889458 


4.32070047140 


2.68160673888 



Table 4: Bandwidth selection strategy (ipbw) results 



27 



'i'gr-piN, 5), d grp (c.f. Section 6.6), Web size 50KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


6 = 3 


0.935021877289 


3.61296200752 


1.59488223791 


0.545028374794 


5 = 4 


0.998504877090 


3.74897003174 


1.77225045919 


0.548956074123 


5 = 5 


1.195134162900 


4.21774697304 


2.02931211710 


0.576776679346 


6 = 6 


1.267808914180 


3.35924196243 


2.18245174408 


0.502899662482 


ip grp (N, 6), dgrp (c.f. Section 6.6). Web size 150KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


1.112107038500 


5.53429508209 


2.04227621531 


0.790901626275 


5 = 4 


1.290552854540 


5.68215894699 


2.66674958944 


0.944641197284 


5 = 5 


1.163586854930 


7.41387891769 


2.68937173843 


0.917799034111 


5 = 6 


1.550453186040 


5.40707683563 


3.00299987316 


0.935654846647 


ipgr P {N, 5), d gr p (c.f. Section 6.6), Web size 320KB 


Circ. length 


Min. 


Max. 


Avg. 


Std. dev. 


5 = 3 


1.502956867220 


7.29033994675 


2.51847231626 


1.009688576850 


5 = 4: 


1.498482227330 


6.52234792709 


3.22330027342 


1.061893260420 


6 = 5 


1.734797000890 


6.73247194290 


3.31047295094 


0.940285391625 


5 = 6 


1.689666986470 


7.89933013916 


3.46063615084 


1.094579395080 



Table 5: Graph of latencies selection strategy (ip gr p) results 



As we can observe, the degree of anonymity has dropped significantly when we compare it with the 
results of the other strategies. However, sacrificing a certain level of anonymity incurs in a drastic 
fall of the latency needed to download a Web page, as it can be noticed if we compare Figures 7a, 7b 
and 7c. In fact, this selection of nodes methodology provides the best performance in terms of the 
time required to download a Web page among the other alternatives. It is also interesting to remark 
the fact that the standard deviation of the time measured in this method remains nearly constant 
regardless of the circuit length and the size of the Web page. This seems reasonable since the more 
geographically near are the nodes, the less random interferences affect to the whole latency. We 
can understand this if we think in terms of the number of networks elements (i.e., routers, switches, 
etc.) involved in the TCP/IP routing process between every pair of nodes. Thus, a pair of nodes 
which belong to the same country will be interconnected through less network elements compared 
to two nodes which belong to different countries and, as a consequence, the latency will be more 
stable along time. This can be an interesting fact, since the penalty introduced by the use of Tor 
affects less to the psychological perception of the user when browsing the Web [5]. Nevertheless, 
the anonymity degree of this strategy is strongly tied to the fixed country, since — as we pointed 
out in Theorem 4 — the less nodes belonging to the country, the less anonymity degree is provided. 

6.5 Bandwidth selection of nodes strategy evaluation 

The anonymity degree of the bandwidth selection of nodes strategy has been computed empirically 
according to its associated formula (cf. Section 3.3 for details). In particular, the torspd.py 
application was in charge of obtaining the bandwidth of every node of our private Tor network 
and of calculating the anonymity degree. Thus, the anonymity degree when the evaluation of this 
strategy was performed was approximately 0.9009. It is important to highlight that, in spite of the 
fixed bandwidth specified in the configuration, the bandwidth of every onion router is estimated 
periodically by the Tor software running at every node, and provided later to torspd.py through 
the directory servers. Indeed, if we think that the established bandwidth of a node through its 



28 



configuration does not necessarily correspond to the real value, then the anonymity degree can 
change in time in comparison to the previous strategies. 

From the viewpoint of the latency results, we can observe how the bandwidth selection of 
nodes strategy improves the values respect to the random strategy by sacrificing some degree of 
anonymity. However, it does not achieve the transmission times of the geographical methodology. 
The reason for that is because this strategy does not take into account important networking as- 
pects, such as network congestion, number of routers, etc., that also impact the transmission times. 
Therefore, it is fairly reasonable that this methodology is more susceptible to networking prob- 
lems, resulting in an increase of the eventual transmission time results. This is also corroborated 
by the standard deviation results, noting the lack of stability of the results. In fact, the transmission 
times increase as the size of the Web page or the length of the circuit also increase. 

6.6 Graph of latencies strategy evaluation 

The experimental evaluation of our proposal has been performed after the establishment of the 
parameters of its related algorithms. In particular, they were At = 5, m = 3, k = 300 and 
maxjiter = 5. Furthermore, the Latency Computation Process was launched two hours before 
the execution of webspd . py, leading to an analytical graph with a set of more than 3,000 edges, 
and which represents a density value of, approximately, 0.67. At this moment, the torspd.py 
estimated the degree of anonymity in accordance to the formula presented in Section 5.3. Since 
such equation depends on the length of the circuit, the anonymity degree was estimated for lengths 
3, 4, 5 and 6, giving the results of 0.9987, 0.9984, 0.9982 and 0.9981, respectively. As occurs with 
the previous strategy, the degree of anonymity is dynamic over time, and in this case depends on 
the connectivity of the analytical graph. Nevertheless, the anonymity degree was not estimated 
again during the evaluation tests. 

Function ct was implemented by means of the construction of random circuits of length m. 
Such circuits are not used as anonymous channels for Web transmissions, but to estimate the laten- 
cies of the edges. This is possible since during the construction of a circuit, every time a new node 
is added to the circuit, the Latency Computation Process is notified. Hence, it is easy to determine 
the latency of an edge by subtracting the time instants of two nodes added consecutively to a cer- 
tain circuit. Regarding this modus operandi of measuring the latencies, it is interesting to highlight 
two aspects. The first one is that it meets the restriction of estimating the latencies secretly; and 
the second one is that it not only measures the latencies in relation the network solely, but also 
takes into consideration delays motivated by the status of the nodes or its resources limitations. 
This way, our proposal models indirectly some negative issues which the other strategies do not 
reflect, leading to an improvement of the transmission times as the obtained results evidence. 

By comparing the results of the previous strategies with the current one, we can observe how 
our new proposal exhibits a better trade-off between degree of anonymity and transmission latency. 
Particularly, from the perspective of the transmission times, our proposal is quite close to those 
from the geographical selection strategy, while it provides a higher degree of anonymity. Indeed, if 
we compare our strategy from the anonymity point of view, we can observe that only the random 
selection of nodes criteria overcomes our new strategy, but, as already mentioned, by sacrificing 
considerably the transmission time performance. 

7 Related Work 

The use of entropy-based metrics to measure the anonymity degree of infrastructures like Tor 
was simultaneously established by Diaz et al. [11] and Serjantov and Danezis [12]. Since then, 
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several other authors have proposed alternative measures [30]. Examples include the use of the min 
entropy by Shmatikov and Wang in [31], and the Renyi entropy by ClauB and Schiffner in [32]. 
Other examples include the use of combinatorial measures by Edman et al. [33], later improved 
by Troncoso et al. in [34]. Snader and Borisov proposed in [35] the use of the Gini coefficient, as 
a way to measure inequalities in the circuit selection process of Tor. Murdoch and Watson propose 
in [36] to asses the bandwidth available to the adversary, and its effects to degrade the security of 
several path selection techniques. 

With regard to literature on selection algorithms, as a way to improve the anonymity degree 
while also increasing performance, several strategies have been reported. Examples include the use 
of reputation-based strategies [37], opportunistic weighted network heuristics [35,38], game the- 
ory [39], and system awareness [40]. Compared to those previous efforts, whose goal mainly aim 
at reducing overhead via bandwidth measurements while addressing the classical threat model of 
Tor [7], our approach takes advantage of latency measurements, in order to best balance anonymity 
and performance. Indeed, given that bandwidth is simply self-reported on Tor, regular nodes may 
be mislead and their security compromised if we allow nodes from using fraudulent bandwidth 
reports during the construction of Tor circuits [37,41]. 

The use of latency-based measurements for path selection on anonymous infrastructures has 
been previously reported in the literature. In [42], Sherr et al. propose a link-based path selec- 
tion strategy for onion routing, whose main criterion relies, in addition to bandwidth measures, 
on network link characteristics such as latency, jitter, and loss rates. This way, false perception of 
nodes with high bandwidth capacities is avoided, given that low-latency nodes are now discovered 
rather than self-advertised. Similarly, Panchenko and Renner [43] propose in their work to comple- 
ment bandwidth measurements with round trip time during the construction of Tor circuits. Their 
work is complemented by practical evaluations over the real Tor network and demonstrate the im- 
provement of performance that such latency-based strategies achieve. Finally, Wang et al. [44, 45] 
propose the use of latency in order to detect and prevent congested nodes, so that nodes using the 
Tor infrastructure avoid routing their traffic over congested paths. In contrast to these proposals, 
our work aims at providing a defence mechanism. Our latency-based approach is considered from 
a node-centred perspective, rather than a network-based property used to balance transmission de- 
lays. This way, adversarial nodes are prevented from increasing their chances of relying traffic by 
simply presenting themselves as low-latency nodes, while guaranteeing an optimal propagation 
rate by the remainder nodes of the system. 



8 Conclusion 



We addressed in this paper the influence of circuit construction strategies on the anonymity degree 
of the Tor {The onion router) anonymity infrastructure. We evaluated three classical strategies, with 
respect to their de-anonymisation risk and latency, and regarding its performance for anonymising 
Internet traffic. We then presented the construction of a new circuit selection algorithm that con- 
siderably reduces the success probability of linking attacks while providing enough performance 
for low-latency services. Our experimental results, conducted on a real-world Tor deployment over 
PlanetLab confirm the validity of the new strategy, and shows that it overperforms the classical 
ones. 
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A Number of walks of length A between any two distinct vertices of a K n 
graph 

Let K n be a complete graph with n vertices and ni - n ~ 1 ' i edges, such that every pair of distinct 
vertices is connected by a unique edge. Then, a walk in K n of length A from vertex v\ to vertex 
Ua+i corresponds to the following sequence: 



ei e 2 e 3 e 4 
V\ > Vi > V 3 > U 4 > 



e A -l e x 
> V X — > l'x+1 



walk in G of length A 

such that each Vi is a vertex of K n , each ej is an edge of K n , and the vertices connected by ej are 
vi and v i+1 . 

Let A be the adjacency matrix of K n , such that A is an n-square binary matrix in which each 
entry is either zero or one, i.e., every (i, j)-entry in A is equal to the number of edges incident to 
and Vj. Moreover, A is symmetric and circulant [46]. It has always zeros on the leading diagonal 
and ones off the leading diagonal. For example, the adjacency matrix of a complete graph K± is 
always equal to: 



.4 



111 
10 11 
110 1 
1110 



The total number of possible walks of length A from vertex Vi to vertex Vj is the (i, j)-entry of A x , 
i.e., the matrix product, denoted by (•), of A copies of A [47]. Following the above example, the 
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number of walks of length 2 between any two distinct vertices can be obtained directly from A 2 , 
such that 



A* 



A- A 



which leads to 
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Note that any (i, j) -entry of A 2 (where i ^ j) gives the same number of walks of length 2 from 
any two distinct vertex Vi to vertex Vj . The total number of walks of length 2 between any two 
distinct vertices can, thus, be obtained by consecutively adding the values of every -entry off 
the leading diagonal of matrix A 2 . In the above example, it suffices to sum (4(4 — 1)) times (i.e., 
twice the number of edges in A'4) the value 2 that any (i, j)-entry (where i ^ j) has in A 2 . This 
amounts to having exactly 24 possible walks on any K4 graph. 

Therefore, the problem of finding the number of walks of length A between any two distinct 
vertices of a K n graph reduces to finding the (i, j)-entry of A x , where i ^ j. Indeed, let a X j be 
the -entry of A x . Then, the recurrence relation between the original adjacency matrix A, and 
the matrix product of up to A — 1 copies of A, i.e., 



A = A 



A-l 



■A 



(3) 



with initial conditions: 

(n-2)if i^j 1 f 1 if i?j 

(n - 1) if i = j ' ^ \ if i = j 

is sufficient to solve the problem. Notice, moreover, that the result does not depend on any precise 
value of either i or j. Indeed, it is proved in [47] that there is a constant relationship between the 
(i, j)-entries off the leading diagonal of A x and the (i, j)-entries on the leading diagonal of A x . 
More precisely, let t x be any -entry off the leading diagonal of A x (i.e., t x = a- such that 



j). Let d be any (i, i)-entry on the leading diagonal of A (i.e. 



t x = 



a X ,)- 



i-j 
Then, 



if we 

A 



subtract t from d , the results is always equal to (— 1) . In other words, if we express A x as 



follows: 



A" 



t x if i / j 
d x if i 



J 



then t x = d x + (— 1) A . We can now use the recurrence relation shown in Equation (3) to derive 
the following two results: 



t x = (n-2)t x - 1 +d x ~ 1 
d x = (n-l)^- 1 



(4) 
(5) 



with the initial conditions t 1 



= 1 and d 1 = 0. 
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Cumbersome, but elementary, transformations shown in both [46] and [47] lead us to unfold 
the two recurrence relations in both Equation (4) and (5) to the following two self-contained ex- 
pressions: 

x (n-l) A -(-l) A 
n 

dX = (n-l) A + (n-l)(-l) A 
n 

To conclude, we can now use Equations (6) and (7) to express the total number of closed and 
non-closed walks in the complete graph K n by simply adding to them twice the number of edges 
in the graph, i.e., n(n — 1). From Equation (6) we have now the value of any (i, j)-entry in A x 
such that i ^ j. As we did previously in the example of the complete graph K4, the total number 
of walks of length A between any two distinct vertices can be obtained by consecutively adding 
n(n — 1) times the values of any of the (i, j)-entries off the leading diagonal of matrix A x . This 
amounts to having exactly n(n — 1) • t x which simplifying leads to: 

((n-l)((n-l) A -(-l) A )) (8) 

possible walks of length A on any K n graph. 



